Record Location and Reconfiguration in Unstructured Multiple-Record Web Documents
نویسندگان
چکیده
Record extraction from data-rich, unstructured, multiplerecord Web documents works well [8], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [9]), many do not. When some values of textual records are factored out, are split unnaturally across boundaries, are joined unnaturally within boundaries, or are linked by off-page connectors, or when desired records are interspersed with records that are not of interest, it is difficult to automatically cull records and piece values together to form clean, delineated chunks of text that each represent a single record of interest. In this paper we attack this problem and propose an algorithm to find and rearrange (if necessary) records in an HTML document by attempting to maximize a record-recognition heuristic with respect to a given application ontology. Tests we conducted show that this technique properly locates and reconfigures records for all classified types of rearrangements both for artificial and for actual multiple-record Web documents.
منابع مشابه
Locating and Reconfiguring Records in Unstructured Multiple-Record Web Documents
Record extraction from data-rich, unstructured, multiplerecord Web documents works well [9], but only if the text for each record can be located and isolated. Although some multiple-record Web documents present records as contiguous, delineated chunks of text (which can thus be located and isolated [10]), many do not. When some values of textual records are factored out, are split unnaturally a...
متن کاملConceptual-Model-Based Data Extraction from Multiple-Record Web Pages
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document’s content. For these kinds of data-rich, multiple-record documents (e.g. advertise...
متن کاملRecognizing Ontology-Applicable Multiple-Record Web Documents
Automatically recognizing which Web documents are “of interest” for some specified application is non-trivial. As a step toward solving this problem, we propose a technique for recognizing which multiple-record Web documents apply to an ontologically specified application. Given the values and kinds of values recognized by an ontological specification in an unstructured Web document, we apply t...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملGeographic Focus Detection for Web Documents using Multiple Location Taggers
Being able to identify locations associated to a Web resource is essential for providing location-based Web applications. However, geographical information in Web documents is rarely supplied in a machine-readable way and therefore not easily discoverable. As a consequence, it is necessary to extract geographical keywords from Web documents and to associate locations with them. This method is c...
متن کامل